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Abstract 

The density deconvolution problem involves recovering a target density 
g from a sample that has been corrupted by noise. From the perspective of 
Le Cam’s local asymptotic normality theory, we show that non-parametric 
density deconvolution with Gaussian noise behaves similarly to a low¬ 
dimensional parametric problem that can easily be solved by maximum 
likelihood. This framework allows us to give a simple account of the 
statistical efficiency of density deconvolution and to concisely describe 
the effect of Gaussian noise on our ability to estimate g, all while relying 
on classical maximum likelihood theory instead of the kernel estimators 
typically used to study density deconvolution. 

Keywords. Adaptive minimaxity, Gaussian sequence model, local asymp¬ 
totic normality, relative efficiency. 


1 Introduction 

Suppose that we observe n samples W drawn from a hierarchical model 

Xi=pLi + ei, with £*^^(0, 1), (1) 

and our goal is to estimate the unknown density g{-). This problem, sometimes 
called the density deconvolution problem, is remarkably hard in terms of asymp¬ 
totic statistical criteria. For example, as shown by Carroll and Hall [1988] and 
Fan [1991], if we assume a non-parametric setup where g is only known to have a 
Lipschitz-continuous fc-th derivative, then the minimax error rate for estimating 
g under the integrated squared error loss decays as log(n)“(^''"^). 

The density deconvolution problem is traditionally studied using kernel meth¬ 
ods. The motivation for kernel estimators is that they often achieve adaptive 

*I am deeply grateful to Brad Efron for many enlightening conversations as well as his 
continual encouragement, and to Dave Donoho and Will Fithian for several helpful comments 
and suggestions. This work was supported by a B. C. and E. J. Eaves Stanford Graduate 
Fellowship. 
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minimaxity over natural regularity classes for g\ for example, they in fact achieve 
the optimal rates of Carroll and Hall [1988] and Fan [1991] described above. 
The properties of kernel density deconvolution have been analyzed by several 
authors, including Butucea and Comte [2009], Carroll and Hall [2004], Comte 
and Lacour [2011], Efromovich [1997], Fan and Koo [2002], Hall and Lahiri 
[2008], Hall and Meister [2007], Stefanski and Carroll [1990], Wand [1998], and 
Zhang [1990], 

Despite the prolific literature devoted to them, however, kernel methods do 
not necessarily yield a fully satisfying theory of the statistics of density decon¬ 
volution. In particular, Efron [2014a,b] proposed a simple maximum likelihood 
approach to density deconvolution that performs qualitatively better than ker¬ 
nel methods on several realistic scientific tasks. Thus, it appears that while 
kernel methods are nearly optimal in terms of the standard asymptotic criteria 
used in the literature, they are not always optimal in practical applications. 
This suggests a need for a new optimality theory for density deconvolution. 

The goal of this paper is to move us towards such a theory. We begin with a 
close analysis of Efron’s method, which involves estimating the unknown density 
g by maximum likelihood with a p-parameter model 

9r, (m) = 50 (m) exp [p • T (p) - V' (v)] , (2) 

where T (ij.) is some carefully chosen p-parameter statistic and V’ (’) is the log- 
partition function. Given that Efron’s method involves parametric maximum 
likelihood estimation, it may appear surprising that this method would work well 
in a non-parametric setup where we only know that the target density g belongs 
to some regularity class. However, we show that maximum likelihood estimation 
in the model (2) with an appropriate choice of T has adaptive minimaxity 
properties that are reminiscent of those enjoyed by kernel estimators. Moreover, 
because Efron’s method relies on maximum likelihood instead of the ad-hoc 
kernel inversion procedure used by classical methods, we may also hope for the 
method to be well-behaved in a wide variety of practical applications. 

We build our analysis around a relative efficiency criterion, which measures 
the information loss for estimation in the model (2) we incur from the noise e 
in (1). Our main result is that, given a specified carrier poi there exists a low¬ 
dimensional statistic T that captures essentially all the information contained in 
the X-sample for estimating local perturbations to go. Any model of the form 
(2) that tries to use a higher-dimensional parametrization will have some bad 
contrasts that could have been accurately estimated from clean observations 
Pi, but become effectively impossible to estimate from the noised observations 
Xi. If go is Gaussian, the number of samples needed to accurately estimate a 
p-parameter family of the form (2) scales exponentially in p, and this bound 
holds uniformly over any possible choice of the statistic T. 

At face value, our results can be interpreted as a companion to the classical 
hardness results of Garroll and Hall [1988] and Fan [1991]: our analysis implies 
that it is impossible to accurately estimate any parametric family of the form 
(2) in terms of relative efficiency if p is even moderately large. Thus, no matter 
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how clever we may be, we cannot use Efron’s density deconvolution model to 
efficiently learn a rich model for g. 

However, from a decision-theoretic point of view, our result can also be 
interpreted in a more optimistic light. Because high-dimensional models of the 
form (2) are effectively impossible to estimate, we lose almost nothing by just 
using a low dimensional model. Thus, the non-parametric density deconvolution 
problem effectively reduces to parametric inference for Efron’s model (2) with 
an appropriate low-dimensional parametrization T. 

Our results take on a particularly simple form when the carrier is Gaus¬ 
sian. In this case, the “optimal” statistics are polynomials in fx, yielding the 
class of estimators 



( 3 ) 


where p is a tuning parameter. As we will show, given an appropriate (usu¬ 
ally small) choice of p, parametric inference for this deceptively simple model 
is nearly equivalent to optimal non-parametric inference for g.^ More specifi¬ 
cally, this class of estimators is asymptotically nearly adaptively minimax under 
Kullback-Leibler (KL) loss for estimating local perturbations of go 



where r(-) is a tilting function satisfying regularity conditions detailed in Section 
3. Moreover, the efficiency shortfall of the best estimator of the form (3) relative 
to the minimax estimator can be bounded by a small explicit constant that is, 
for typical parameter values, on the order of 2. 

In summary, this paper introduces a relative efficiency criterion and a local 
perturbation model that, together, enable us to shed new light on the classical 
problem of density deconvolution. We show that the exponential family method 
(2) cannot get around standard non-parametric impossibility results and, in par¬ 
ticular, does not allow us to estimate rich models for g{-). However, the method 
does an excellent job at extracting all the available information from the X- 
sample, all while allowing for straight-forward estimation and inference. Thus, 
despite its simple form, Efron’s parametric density deconvolution method yields 
simple estimators whose excellent practical performance is firmly grounded in 
asymptotic minimaxity theory. 

^In all our experiments, we use p = 4. In practice, p could also be selected by cross- 


validation. 
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1.1 Theoretical Setup: Local Deconvolution 

Throughout this paper, we assume that we observe n samples Xi drawn from a 
hierarchical model 


Xi = ^ii+ei, ei~A/'(0, 1); thus, (5) 

Xi ~ / (•), / (a:) = {^*g) {x ), (6) 


where / denotes the marginal density of the observations X generated according 
to the model (1) and ip is the standard Gaussian density. The setup induced by 
(5) and (6) is classical; however, the analysis techniques we use to understand 
this model are rather different from the ones used by, e.g., Carroll and Hall 
[1988], Efromovich [1997], or Fan [1991]. 

Unlike other authors who model g{-) as a hxed member of a regularity class 
TZ that can be estimated with increasing accuracy as the sample size n grows to 
infinity, we study a sequence of ever-shrinking perturbations of a known carrier 
density go: 


5 '”^ (m) 


go (m) exp 


-^T (fj.) - Ipr, , 
yjn 


(7) 


where r (^) belongs to an appropriate regularity ellipsoid. Notice that, as n gets 
large, estimating r does not necessarily get easier because the deviance between 
go and g^'^^ decays as Ijy/n. As we will show formally in Section 3, the problem 
of estimating g^^'> in (7) under a re-scaled deviance loss 


Ln ( 5 ^"^) =nDKL =n 


(M)log 


(/i) \ 
VgW (/r)y 


dp. 


( 8 ) 


converges to a locally asymptotically normal experiment in the sense of Le Cam 
[I960]. The formalism provided by the sequence of problems (7) thus enables 
us to study density deconvolution from the perspective of classical maximum 
likelihood asymptotics. 

Although the estimation problem in the local perturbation model (7) may 
look very different from the classical estimation problem with g G TZ for some 
fixed regularity set TZ, it appears that studying the former can help us learn 
about the latter. First of all, we find a universality phenomenon, where maxi¬ 
mum likelihood estimation in the model (3) is simultaneously nearly minimax 
for any Gaussian carrier go = if a, and so we can carry out practical data anal¬ 
ysis with a simple default model for g. Second, our analysis recovers familiar 
qualitative aspects of the theory of Carroll and Hall [1988] and Fan [1991], such 
as exponential blow-ups in the number of samples required to estimate more 
complex models; see, e.g.. Theorem 5. Thus, insofar as maximum-likelihood 
estimation in our local perturbation model is amenable to exact asymptotic 
analysis, it appears that the model (7) is a useful theoretical tool for gaining 
new insights about the density deconvolution problem. 

We begin our analysis by developing a theory of relative efficiency for density 
deconvolution in Section 2, with the goal of establishing lower bounds for the 
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error rate of any p-parameter model of the form (2). In Section 3, we complement 
this worst-case picture by establishing near-minimax properties of the estimator 
(3) for estimating local perturbations of Gaussian densities. Our proof technique 
is built on the local asymptotic normality theory of Le Cam [1960] combined 
with Pinsker’s theorem for the Gaussian sequence model. Finally, in Section 4, 
we discuss practical differences between kernel density deconvolution and Efron’s 
method, and the respective optimality theories that justify each method. All 
proofs are provided in the appendix. 

Our minimaxity proof, which first reduces a continuous problem to a Gaus¬ 
sian sequence model and then applies Pinsker’s theorem, fits into a rich literature 
on solving non-parametric problems via Gaussian estimation; see, e.g.. Brown 
et al. [2004], Brown and Low [1996], Efromovich and Samarov [1996], Golubev 
et al. [2010], Johnstone [2011], and Nussbaum [1996]. The problem of density 
estimation using a log-spline model for g of the form (2) has also been considered 
by Koo [1999] and Koo and Chung [1998]; however, their approach more closely 
follows that traditional ideas of Fan [1991] and others. Finally, we note that the 
non-parametric maximum likelihood problem of estimating f = f * g has been 
studied by, among others, Jiang and Zhang [2009], Koenker and Mizera [2014], 
and Zhang [2009]. Efron’s method also induces a natural estimator 
but we do not study its properties here. 

Remark: Cyclic Convolution To avoid technical difficulties relating to in- 
tegrability over unbounded domains, we follow convention and study a cyclic 
convolution model [e.g., Efromovich, 1997]. More specifically, we assume that 
our observations are within a bounded interval = [—M, M], and that we 
have a cyclic convolution operator 


ATm : X IIm- t K-I-, Km{h,x)= ^ (p{x - g + 2jM). (9) 

j=-oo 

We also frequently use the cyclically wrapped-around Gaussian density with 
variance cr^, i.e., {x) = J2^-oo ‘P ((^ + /a) with x S flM, as the 

carrier go- In our results, M should be thought of as large enough that Km (M: x) « 
(p (x — fj.) over the range of the data; formally, we will seek to state results in a 
limit with M —>■ oo. 


2 Relative Efficiency and Most Stable Families 

The observations Xi generated by the model (1) are noisy measurements of 
clean draws pi from g. What is the information loss due to this extra noise? 
Or, in other words, if we can estimate rj to a given accuracy using samples /i 
drawn directly from g, how many samples nx would we need to estimate g to 
the same accuracy using only X samples? Although this question may not at 
first appear directly related to our topic of interest, answering it will prove to be 
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helpful in guiding us towards good choices for the statistic T in Efron’s model 
( 2 ) and in proving lower bounds for the error rate of any parametric model. 

In Section 2.1 below, we define a relative efficiency coefficient that will let 
us make the above question precise. Using this formalism, we then derive the 
“best” families of the form (2) in terms of relative efficiency in Section 2.2. 
Finally, in Sections 2.3 and 2.4 we interpret these relative efficiency results for 
both Gaussian and non-Gaussian carriers go respectively. 


2.1 The Relative Efficiency Coefficient 

If we had access to samples drawn directly from a density gri of the form 
( 2 ), then under standard regularity conditions maximum likelihood estimation 
would yield an asymptotically normal estimator with 




g)^N ( 0 , 1 (??)) , 


iv) — 




loggnin) , 


( 10 ) 


where (g) is the Fisher information for estimating g carried by the /i-samples. 
In our setup, however, we only have access to samples X drawn from ( 6 ); the 
Fisher information for the maximum-likelihood estimator gx is then reduced to 


{g) = -£77 


dg'^ 


^OgfrjiX) 


( 11 ) 


Given this setup, we define the efficiency of gx relative to g^ as 




inf 

oGRp, a^o I a ' (g) a j 


( 12 ) 


and argue that the relative efficiency coefficient p provides a natural measure of 
information loss due to the noise e in ( 1 ). 

If we wanted to measure relative efficiency in a univariate family (2) with 
tilting function t : K —>■ K, then the natural measure of relative efficiency is 


Pv (^) 


lim 

n—^oo 


Var [ 7 ?;^] 
Var [gx] 


iv) ’ 


(13) 


which measures exactly the increase in sample size required for accurate esti¬ 
mation using the X-sample instead of the /i-sample. As we can easily verify, 
the multivariate relative information coefficient (12) is simply the worst-case 
relative efficiency for any univariate subfamily of (2): 


(T) = inf {pn (t) : t (p) = a ■ T (p), a £ . (14) 

This connection provides a first motivation for the definition (12). The coeffi¬ 
cient pri can also be viewed as a natural extension to the classical F-optimal 
criterion for experimental design [Ehrenfeld, 1955]: if T is scaled such that 
= Ipxp, then pr^ (T) is the minimum eigenvalue of Ix —which is exactly the 
criterion that the F-optimal designs maximize. 
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The following result derives a simple functional form for the relative effi¬ 
ciency coefficient. We note a striking similarity between the formula (15) and 
the results of Louis [1982] for maximum likelihood estimation with missing data. 
This similarity is not an accident: we could also interpret the density deconvolu¬ 
tion problem as a one where we would would have wanted to observe ^ = X — e, 
but e is missing. 

We also establish a key theoretical property of p, namely that it is transfor¬ 
mation invariant: if we apply any invertible linear transformation Q to T, the 
value of p remains unchanged. This transformation invariance provides further 
evidence that p is a “natural” measure for understanding the difficulty of den¬ 
sity deconvolution. We note in particular that the Fisher information Ix is not 
transformation invariant. 


Lemma 1. The relative efficiency coefficient (12) is transformation invariant, 
in the sense that po (T) = po (QT) for any invertible linear transformation Q. 
For a univariate statistic, the relative efficiency coefficient can be written as 


Pr, (t) = 


Var^ [E [t{p) I X]] 
Var^ [t (p)] 


(15) 


2.2 Deriving the Most Stable p-dimensional Family 

Given our notion of relative efficiency defined above, it is natural to ask whether 
there exist optimal p-dimensional statistics T in terms of this criterion. Perhaps 
surprisingly, we will show that not only do such optimal statistics exist, but 
they are in general quite easy to compute and can give us guidance for practical 
data analysis. Formally, we define the optimal statistics as solutions to the 
optimization problem 


rp (po) = argmaxy,f,^RP {po (T)} , 


(16) 


where grj (p) = go (p) exp [77 • T (p) — ip (p)] is defined in terms of some known 
carrier. Notice that our definition of F in terms of relative efficiency at p = 0 is 
without loss of generality, since we could always just use p,,(p) as our carrier if 
this condition did not hold. 

In order to solve the problem (16), we begin with a technical result for 
computing the multivariate relative efficiency coefficient. In this section, we 
will assume that the X and p are distributed over a compact interval fl, and 
that X is noised by a generic convolution operator K (p, x). 

Lemma 2. Let T be a p-dimensional statistic; let g (•) = p,, (•) be defined as in 
(2) and let f be the marginal density of the observations X. Then 


Prj (T) = Pjj (a* • T), with 


a = argmm ■ 

||a|L = l 


In ^ (p) p) 9^ jp) f ^ (^) d-x dp a 

In ^ (P) P (p) dp, a 


(17) 

(18) 
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We note that, in practice, the optimization problem described in (18) below 
can be efficiently solved by taking a* = Q^^b*/\\Q7p^b*\\2^ where b* is the eigen¬ 
vector corresponding to the smallest eigenvalue of Qj Mt Qt, and the matrices 
Mt and Qt are defined by 

Mt= [ T{n)T^{y)K‘^{x, y)g^{y)f~'^{x)dxdn, (19) 

Jn 

Qt [ T{n)T^{fi)g{n)dfi Qt = Ipxp- (20) 

Jn 

With this result in hand, we are now ready to hnd the most favorable statistic 
rp(go)- Our construction hinges around eigenfunctions of the linear operator 

Pgidi,d 2 )= (mi, x) f~^ {x)K{x, Ai 2 ) \/g{g- 2 ) dx, (21) 

Jn 

where / is as usual defined as f = ip*g. We can verify that the top eigenfunction 
of Pg is given by y/g (•) and has eigenvalue 1; the p subsequent eigenvectors then 
generate the most favorable p-dimensional family for density deconvolution. 

Theorem 3. Suppose that both the carrier density go : n ^ K_|_ and the ker¬ 
nel if : —>■ K_|_ are continuous and bounded away from 0, and that Pgg as 

defined in (21) is a compact operator over the space L 2 (O) of square-integrable 
functions over fl. Then, Pg^ admits a spectral decomposition Ci, C 2 : where 

the first eigenfunction (/r) = y/gojp) has an eigenvalue 1. Moreover, the 
p-dimensional exponential family of the form (2) with statistics 

Tg{p) = ^l=Q+,{p) ( 22 ) 

V go id) 

is a most favorable p-dimensional family in the sense of (16), and the relative 
efficiency coefficient corresponds to the p-\-l-st eigenvalue of Pg^. Finally, if the 
spectrum of Pgg does not have repeated eigenvalues, the most favorable family 
Fp (go) is unique up to scaling and rotation. 

In other words. Theorem 3 establishes a direct link between the difficulty of 
learning rich perturbation models around g^, and the decay rate of the spectrum 
of the linear operator Pg^: the best-case efficiency for learning a p-parameter 
model depends on the p-th non-trivial eigenvalue of Pgg. 

As a corollary to this result, we also see that if the relative information for 
learning the most favorable p-parameter family is small, then the model (2) 
with statistics rp(po) provides a good X-space approximation to any local per¬ 
turbation of po- Recall that Dxbif, f) effectively measures the power of the 
X-sample likelihood-ratio test for distinguishing / from /'; thus, the result be¬ 
low implies that if Ap +2 {Pgo) is small, it is statistically impossible to detect any 
deviations from the most favorable p-parameter family using only AT-samples. 







Corollary 4. Suppose we have a data-generating function of the form 


M15 •••5 Mn 


g'r'’ (g) = 9o (g) exp 




(23) 


for some square-integrable function t satisfying (g) 9 o (m) ^9 ^ and 
write = (p * . Then, under the conditions of Theorem 3, there exists 

another tilting function *'> in the span of the most-favorable p-dimensional 
family Fp (gg) that can closely approximate /t"^- 


^(p. *) 


1* r ( m ) , n Dkl (4”^, f^Jll <\c^ K+2 (^3„ ) , (24) 


where \p +2 (Pgo) denotes the p + 2-nd eigenvalue of the linear operator defined 
in (21). 


2.3 The Hardness of Local Deconvolution near Gaussian 
Carriers 

The relative efficiency bound given in Theorem 3 is quite general, but under¬ 
standing its implications for practical data analysis may not be trivial at first 
glance. If we are willing to take the carrier density gg to be Gaussian with 
variance cT, however, the quantities dehned in the statement of Theorem 3 take 
on simple and interpretable forms. In the result below, we assume that we 
are doing cyclic deconvolution over the domain Hm = [—M, M] for some large 
M; this lets us state a clean result while avoiding integrability concerns over 
unbounded domains. 

Theorem 5. Suppose that the carrier gg (/i) = (p^ (/i) is the Gaussian with 
variance wrapped over the interval Q,m = [~.W, M], and that we are in 
the cyclic convolution setup over 91 m described in Section 1.1. Then, for any 
p-parameter statistic T, 

Po (T) - _|_ fj-2'^P ”1" °Af(l), (25) 

where the residual term om(I) becomes negligible as M gets large. Moreover, this 
bound is satisfied by the model (2) whose statistics are the normalized Hermite 
polynomials 



for whichip, = Ipy,p, andix is diagonal with entries {l-\-a (l-I-tr ^ 

In other words, Theorem 5 implies that, if we are trying to solve the prob¬ 
lem (2) with a p-dimensional statistic T scaled to have 2^(0) = /pxp, then we 
are necessarily faced with a difficult 1-dimensional sub-family with information 


9 






bounded by (1 + a~^)~P. In particular, it is impossible to accurately distin¬ 
guish local alternatives of 0 for 77 with less than n ^ (1 + observations, 

i.e., regardless of our choice of T, the number of samples required for accurate 
estimation scales exponentially in p. 

Qualitatively, this exponential scaling in p provides a direct analogue to the 
logarithmic convergence rates for kernel density estimation derived by Carroll 
and Hall [1988] and Fan [1991]: If we work in an asymptotic regime where we 
incur a bias on the order of from learning a model with only p degrees 
of freedom, then our result suggests that—heuristically—the error rate cannot 
decay faster than logi +^-2 . The reason this analogy is only heuristic is 
that our theory uses a local perturbation model, whereas that of Carroll and 
Hall [1988] and Fan [1991] is global. The connection between the two theories 
is however encouraging, in that it suggests that our local perturbation model 
enables us to get a nuanced grasp of the statistics on density deconvolution 
without changing the fundamental nature of the problem. 

An important consequence of Theorem 5 is that, for all p > 2 and any cr > 0, 
the family of distributions attaining the bound (25), namely 


9ri (m) = -‘p(-) exp 

(J \(J / 


{-) -^(^) 


(27) 


is equivalent to the family (3) after re-parametrizing rj. In other words, we 
find that the polynomial log-density model (3) is universally the most-favorable 
family for density deconvolution near Gaussian carriers. Moreover, as a di¬ 
rect consequence of Corollary 4, we see that this family can be used to closely 
approximate any local perturbation to a Gaussian. 

In Section 3, we will provide further justification for the estimator (27) by 
establishing minimax-optimality properties. Before doing so, however, we pause 
to study the implications of Theorem 3 in the case when po is not Gaussian. 


2.4 Local Density Deconvolution with Non-Gaussian Car¬ 
riers 

We can also use Theorem 3 to compute most favorable families for estimating 
perturbations to non-Gaussian carriers go. In this case, we can no longer derive 
closed-form expressions for them; however, we can still proceed numerically. 

We begin by examining a “two towers” model, also considered by Efron 
[2014b]. We vary the scale of the carrier 

(a*) = ^ 1 ({1 < ImI < 2 }) and ^ ^ - 1^1 - ' 

The most favorable statistics for each choice are shown in the top row of Figure 1. 
Interestingly, for the thin tower model gg , the towers appear to be too close 
to each other for them to be adequately distinguishable after adding standard 
Gaussian noise to the clean observations pp, thus, the most favorable statistic 
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Two narrow towers 



p: 0.33 
p: 0.15 
p: 0.15 
p: 0.03 
p: 0.03 


l 


1 \ 


-5 0 

Two thick towers 



-2 -1 0 






Gaussian with Spike at 0 


Gaussian with Spike at 2 


Figure 1: Most favorable statistics for various carriers non-Gaussian carriers 
go- In each case, the noise was standard Gaussian e ~ A/’(0, 1); the different 
carriers go are described in Section 2.4. The relative information coefficient p is 
computed univariately for each candidate statistic. 
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(TT ‘2\ 

is effectively the mean T (/i) = fi. Conversely, with , the most favorable 

statistic is an indicator function for the relative magnitude of each tower, and 
the subsequent statistics measure tilts and spreads within each tower. 

We also examined the case where go is a half-and-half mixture of a centered 
Gaussian with variance = 2, and a i5-spike at either g, = 0 ot g = 2. With 
a spike at /r = 0, the most favorable statistics are not that different from the 
Hermite polynomials underlying (3). However, the spike at ^ = 2 changes the 
picture considerably: now, the most stable statistic looks more like a hinge 
T{g) = {g-2)_. 

Of course, it is not immediately clear how these non-Gaussian most favorable 
families are relevant to practical data analysis. A strength of the log-polynomial 
model (3) is that it is most favorable near any Gaussian carrier g^, meanwhile, 
the statistics presented in Figure 1 depend on exact knowledge of go- However, 
whether or not they are immediately useful, the results presented in Figure 1 
can help us gain a better feeling for the flavor of density deconvolution, and for 
how the shape of the carrier go affects the nature of the problem. 

3 Adaptive Local Minimaxity 

In the previous section, we showed that there exist most favorable statistics T 
that capture most of the relevant information in the A-sample in terms of our 
relative efficiency criterion. In this section, we show how to translate this result 
into a more classical minimax framework. Beyond being of independent interest, 
the minimax results provided here may also serve as additional evidence that 
the relative efficiency framework proposed above is “natural” in the sense that 
it helps motivate good statistical procedures. 

In order to describe the asymptotics of Efron’s method, we need a certain 
amount of formalism. To this end, we focus on the local perturbation model de¬ 
scribed in Section 1.1, and show that it induces a sequence of statistical problems 
that converge to a locally asymptotically normal experiment [Le Cam, I960]. 
This connection then lets us draw from the extensive literature on estimation 
in the Gaussian sequence model. 

Here, we focus on the case where the carrier go is Gaussian; this lets us cut 
down on linear algebra and to give closed-form bounds for the minimax shortfall 
of Efron’s method. Eollowing this choice, we use notation g^. instead of go for 
the carrier; here, denotes the variance of the We note, however, that 
exactly the same arguments can be used to derive the limiting risk of any most- 
favorable family of the type considered in Theorem 3; the minimax performance 
of Efron’s method then depends on the decay rate of the spectrum of Pg^ (21)- 
We expand on this connection in Section 3.2. 

Eollowing our discussion in Section 1.1, we are interested in a sequence of 
models 


5^"^ (m) 


9a { 9 ) exp 


—T (g) - , 

.v^ 


9 € -W], 


(28) 
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where we eventually want to take M —>■ oo. We assume that the tilting function 
T : K —>■ K is a generic smooth function of /x contained within an ellipsoid defined 
with respect to its Hermite expansion 


r 


{^l) G A 


(T 

K,C^ 


A 


<7 _ 

K,C — 


(t{-) 


i=i 






(29) 


Here, the parameters k and C let us tune the shape of the ellipsoid, and the 
inner product notation is short for 


('r(-) > (-)) = [ T in) hJ-) in) dfi. ( 30 ) 

Our goal is to get a good estimate for g^^'> from n independent samples 
Xi; we measure loss in terms of the re-scaled Kullback-Leibler divergence (8). 
Throughout our analysis, we will assume that is chosen such as to ensure 
that this loss is finite. 

Our key result is that, given an appropriate choice of p, Efron’s polynomial 
log-density model (3) attains quasi-minimax performance simultaneously over 
this class of problems, regardless of our choice of cr^ and k. We note that the 
constant on the right-hand side of (31) can be quite small. For example, if 
we take the carrier to be standard Gaussian = 1, and set the ellipse shape 
parameter to k = 2, we have acr, k ~ 1-7 and /3a, k = 3. Thus, the polynomial 
log-density model (3) gets to within a factor 1.8 of the minimax risk. 


Theorem 6. Suppose that we are trying to solve the sequence of problems de¬ 
fined by (28), (29), and (8). Let denote maximum likelihood estimator in 

the polynomial log-density model (3) with p chosen as in (35), and let be 

the minimax optimal estimator over the regularity class defined in (38). Then, 
there is a constant Ca-,k such that, for C > Ca,K., 


lim lim sup 

M—^OO n—¥oo 


snpaeA- 


< 


I3a,ii 

K, 


(31) 


where the constants Oo-, k and /3a, k are given in (33) and (37) respectively. 

To establish Theorem 6, we first need to derive the worst-case risk of the 
minimax estimator g^^’*\ The key step in the proof is to show that the estima¬ 
tion problem outlined above converges to an elliptically constrained Gaussian 
sequence model. Now, as shown by Pinsker [1980], linear estimators get to 
within a constant factor of the minimax risk for this class of problems, and so 
the minimax estimation problem reduces to a linear estimation problem that 
can be solved directly; see Johnstone [2011] for a review. The factor 4/5 in the 
constant (33) is a bound obtained by Donoho et al. [1990] for the sub-optimality 
constant in Pinsker’s result. 
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Lemma 7. Under the conditions of Theorem 6, for C > Cu^k, 


lim liminfi inf sup 
M->oo n->oo I g(r,, tGA" p ' 


2 log(T-f7 ) 


(32) 


where Af is an L 2 regularity class defined in (38), and 

^ 4 rl {k - 1) / {rln - l) (rg/c^ - l) 

5 (r^ — 1) (r^K — 1) y r'^K^K—l) 

and To- is short-hand for the constant 

rl = {l + a^)la^. 


2 1og{r^) 
log(r< 3 .)+log{K) 


(33) 


(34) 


We can also derive the risk of using the machinery developed for 

proving Lemma 7. Comparing (32) and (36), we see that both estimators have 
the same dependence on the signal scale C, and only differ by a constant function 
of cr^ and k. 

Lemma 8. Suppose that the true density is as in the statement of Theorem 
6, and we estimate it using the exponential family model (3) with 


p = max < 2, 


log {C/a) 
log {r„K) 


-1 . 


Then, for C >Ca,K., 


lim 

M-f-OO 


limsupi sup L„ \ < 

n^oo yr&Alc ^ '] 


with 


2 1°g(>t) 


= {^ + rl) cri°g(-..)+i°E(-). 


(35) 


(36) 


(37) 


3.1 Density Deconvolution and the Gaussian Sequence Model 

Here, we outline the proof of Lemma 7 by showing how our density estimation 
problem converges to a Gaussian sequence model. Throughout our analysis, we 
assume that our density estimate is in the class defined by the following 
relation 



for some large constant > C^, along with the constraint that all integrals 
defined above be finite and well-defined. 
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Thanks to our regularity assumption, we can use notation from (30) to define 
“Hermite coefficients” 

"> = ("<■>■">(;))„■ (=>*>' 

Meanwhile, given our sample Xi, ..., we can also define empirical Hermite 
coefficients as 


y{n) ^ J_ / l+a'^ 


j72 





(40) 


Because converges to the carrier qq as n gets large, we can show that for 
any finite set of indices {ji}, the 2’-"^ are asymptotically jointly normal with 


lim E 




= Iji +om(1), 


lim Cov 

n—XX) 


Z 


(n) 


^(n) 


= +om(1) 


(41) 

(42) 


where we again used the notation = (l + 0-2) 

These observations suggest that deriving a good estimator for the den¬ 
sity is related to finding the mean of the Gaussian sequence with covariance 
structure (41). The following lemma makes this connection explicit, thus reduc¬ 
ing our setting to a well-understood problem, namely estimating a Gaussian 
sequence model with elliptical constraints under squared-error loss. 


Lemma 9. For large M, the limiting minimax risk for the sequence of density 
deconvolution problems defined in Lemma 1 converges to the minimax risk for 
estimating the mean of a Gaussian sequence Zj for j = 1, 2, ... with 

E [Zj] = 7i, Cov [Zj, Zk] = 6{j=k}rl^ (43) 


under the loss 


L(j(Z))=J2(7(Z)-lf, 


(44) 


i=i 


and the constraint 7 G (C) := |7 : In other words, 

lim lim I inf sup U 

M—>-oo n—>-<50 qGAZ n-a V V // 


(45) 


= inf sup LiflZ)). 
7e^S(C)^g^j(C) 


Now, this class of Gaussian sequence models can be well estimated using 
linear rules: Pinsker [1980] established that the risk of the best linear rule is 
within a constant factor of the minimax risk; furthermore, Donoho et al. [1990] 
showed that this constant is less than 5/4. Thus, it suffices to find the risk of 
the best linear rule of the form = CjZj for some constants Cj. Now, in the 
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Gaussian limit, the worst-case risk of the minimax linear rule 7 ^ has a simple 
form [e.g., Johnstone, 2011, Chapter 5.1]: 

-.= sup E [L ( 7 ^)] , (46) 

■yeesio V Me 7 + 

where fic is implicitly defined by the relation 

00 

rl^K^ {nc - = C^- (47) 

i=i 

Since we know that the minimax risk of R* is bounded from below by AR^ j5, 
the proof of Lemma 7 reduces to algebra; the remaining steps are carried out in 
the appendix. 


3.2 Extension to General Carriers 

In an effort to reduce notational burden, the above argument focused on the 
Gaussian carrier case, i.e. go = for some cr > 0. In this section, we briefly 
outline the technical ideas needed to prove analogous local minimaxity results 
in the neighborhood of general carriers go- The proof of Theorem 6 relied on 
showing that large-sample statistical inference in the local density deconvolution 
model reduces to a study of the asymptotically normal “empirical Hermite co¬ 
efficients” defined in (40) . A similar result also holds in the case of general 
go', however, the statistics now take on a more general form. 

Let be the statistics comprising the most-favorable family around 

go, and denote the relative efficiency of the j-dimensional most-favorable family 
by pj. Then, inference can be asymptotically framed in terms of the moments 


1 "■ 

U{x) 

^ i—1 


{(j) * (Tj go)) (x) 
fo{x) 


(48) 


The following lemma shows that the Z^ in fact have the desired limiting dis¬ 
tribution. An analogue to Lemma 9 then follows directly. 

Lemma 10. Suppose that the conditions of Theorem 3, and that the perturba¬ 
tion function r satisfies (/i) go (p) dp, < 00 . Then, then the Z^ defined 
in (48) are asymptotically normal over any finite set of indices and, for any 

3, f e N, 


lim E 

n—>-oo 



lim Gov 

n—>-oo 





(m) t (m) 9o (m) dp, 
— d{j=j'} Pj ■ 


(49) 

(50) 
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Given this result, we can use Pinsker’s theorem to compute tight bounds for 
the minimax risk of density deconvolution near go as a function of the most- 
favorable relative efficiency coefficients pj (i.e., the spectrum of Pgg defined in 
(21)). We note that Lemma 10 —and in fact the whole machinery of using a 
Gaussian sequence model to understand the behavior of maximum likelihood es¬ 
timation in the most-favorable family—only depends on the square-integrability 
condition (p) go (p) dp < oo and on the compactness of Pg„. However, in 
order for the resulting minimaxity properties to be any good, we need for the 
spectrum of Pg„ to decay fast. 

4 Two Optimality Theories for Density Decon¬ 
volution 

The main contribution of our paper is a local optimality theory for density de- 
convolution that helps us understand and justify Efron’s method, i.e., maximum 
likelihood estimation in a model of the form (2). This line of work is in contrast 
to the classical optimality theory based on kernel estimators. Notable contribu¬ 
tions include the pioneering work of Garroll and Hall [1988] and Stefanski and 
Garroll [1990], the analysis of Fan [1991] that elucidates the connection between 
the decay rate of the Gaussian characteristic function and the difficulty of den¬ 
sity deconvolution, the strong quasi-minimaxity results of Efromovich [1997], as 
well as several other papers cited in the introduction. 

A key difference between these theories is that kernel-based methods are 
quasi-optimal in terms of an integrated squared error (ISE) loss criterion whereas 
our theory is framed in terms of the Kullback-Leibler (KL) loss, defined respec¬ 
tively as 



(51) 


(52) 


Moreover, our theory aims to detect local perturbations of go, whereas the ISE 
criterion is usually applied globally. 

From a scientific point of view, the value of an optimality theory depends 
on the relevance of the induced estimators to answering real-world questions. 
Kernel-based methods have a good track record for solving some classic prob¬ 
lems, such as in-season baseball prediction problem introduced by Efron and 
Morris [1975]; see, e.g., Brown [2008]. In fact, these methods have been shown 
to approach the Bayes risk for estimating the posterior mean E [/r | V = x] 
[Brown and Greenshtein, 2009]. 

In this section, however, we present examples of natural scientific questions 
for which Efron’s method provides substantially better answers than kernel 
methods. We hope that these examples will convince the reader that a KL- 
based optimality theory for density deconvolution is, if nothing else, worthy of 
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further study. For our experiments, we used the R-package decon [Wang and 
Wang, 2011] for kernel-based estimation. 

4.1 Detecting the Fraction of Weakly Associated Genes in 
a Microarray Study 

Our first example simulates a classic biological application: gene expression 
profiling using microarrays. At a high level, the goal is to estimate the difference 
in the expression levels of different genes for two distinct cell populations (e.g., 
breast cancer cells vs. healthy cells). After some pre-processing, each gene can 
be associated with a test statistic that has a standard normal distribution under 
the null hypothesis that the gene’s expression levels do not differ accross the two 
groups; the resulting statistical problem is described in detail by Efron et al. 
[2001] and Tusher et al. [2001]. 

Here, we follow the structural model analysis of Efron [2004] who showed 
that, to reasonable approximation, we can model the gene-wise test statistics 
Xi as Xi ~ A/” (/ii, 1), where describes the true association between the f-th 
gene and the condition of interest. The i-th “exact” null hypothesis is that 
fii = 0. However, as argued in, e.g.. Chapter 6 of Efron [2010], this exact null 
may not always be scientihcally relevant. It seems likely that most genes have 
a small but non-zero true association fii] the goal of the statistician is then not 
to identify the non-zero /x^ but rather to identify the large associations /x^. 

Motivated by this setup, suppose that—given the power afforded by our 
sample size—we decide that having /x^ G [—2, 2] qualifies as a “small” associa¬ 
tion, and we want to estimate the fraction of genes whose association is in this 
range. In our framework, assuming that the gene associations /Xi are drawn from 
a distribution with density g{ ), we want to estimate J^29 (/^) Figure 2, 

we ran 1,000 replicates of such a simulation with n = 5, 000 genes each, where 
the true density g was given by 

g (m) = 0.95 • i (2 - 1^1)+ + 0.05 • ^1 {{\gi\ < 10}). 

We then generated X-statistics from the structural model X ^ Af [fx, 1), and 
generated estimates g both using the kernel method decon and maximum like¬ 
lihood in the family (3) with p = 4. 

In this simulation, the correct answer was g (/x) dp, = 0.96. Efron’s 
method (3) does substantially better than the kernel method, with the exception 
of a few cases where it vastly underestimates the bulk of g{-). In practice, 
Efron [2014a] recommends the use of regularization to stabilize the estimators 
and mitigate such finite sample effects. Here, however, we did not use any 
regularization in order to keep our experiments as simple as possible. 

4.2 Identifying Safe Neighborhoods 

Our second example is based on a real dataset: the communities and crime un¬ 
normalized dataset from the UCI Machine Learning Repository [Lichman, 2013], 
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Figure 2: Gene expression profiling simulation. The goal is to estimate 
f -2 9 here, the correct answer (0.96) is indicated by a thick vertical line. 

Over 1,000 simulation replicates, Efron’s method usually performs substantially 
better than the kernel-based alternative. 


which tallies crime data from multiple US communities in 1995; we restricted 
our analysis to the n = 1,192 communities with population greater than 20,000 
and non-missing crime data. 

Our goal was to understand the number of non-violent crimes per 100 people. 
In our sample of communities, the average non-violent crime rate was around 
5 per 100 inhabitants, while 6% of communities achieved a non-violent crime 
rate of 2%. We define a community with a rate of less than 2% as safe. The 
dataset was large enough that we could accurately estimate per-community 
crime rates; to test our deconvolution methods, we made the problem harder 
by down-sampling the data and then seeking to recover the correct answer. 

To down-sample the data in a mathematically principled way, we made the 
practically somewhat implausible assumption that each person was victim of 
at most 1 crime in 1995. Then, we can imagine assembling a crime dataset by 
interviewing B = 500 randomly selected people per community, and counting 
the number W of interviewees in the i-th community that have been victims of 
a non-violent crime. We can easily simulate the outcome of such an interview 
using the available data by hypergeometric sampling: 

Ni ^ Hyper {B, Crimes in community i, Population i). 

Our statistical task is to estimate which communities are safe (i.e., have a rate 
of less than 2%) based on statistics W collected in different communities. Let 
Pi denote the true crime rate in the f-th community, and let pi = Ni/B. By a 
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In-Sample Crime Rate [per 100 Residents] 


Figure 3: Crime prediction example. The goal is to predict the probability 
that a community is safe {pi < 0.02) given a crime-rate estimate pi obtained 
by interviewing B = 500 randomly selected people. Efron’s method provides a 
closer approximation of the oracle rule than the kernel method. 


standard variance stabilizing argument, we can verify that 

Thus, to estimate P [pi < 0.02|pi], we can first apply a square-root transform to 
the data, then use any method to estimate the density g of the , and finally 
apply Bayes’ rule to get our desired quantity. 

Figure 3 shows results for the kernel method and Efron’s method; both den¬ 
sity deconvolution methods were implemented exactly the same way as for the 
gene expression example. The oracle rule was produced by fitting a spline logis¬ 
tic regression of 1 {{pi < 0.02}) against Pi. Efron’s method again substantially 
outerforms the alternative. For example, if we ask about the probability that a 
community is safe given that pi = 0.02, Efron’s method tells us that this prob¬ 
ability is 18.6% whereas the kernel method tells us 39.6%. The correct answer, 
produced by the oracle ht, is 21.5%. 
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Figure 4: A comparison of Efron’s method with a log-polynomial model (3) with 
p = 4 with kernel-based density deconvolution as implemented in decon. We 
tried four different bandwidth choices for the kernel method. 


4.3 Fitting the Mode vs. Fitting the Snpport 

To provide some deeper insight into the differences between the two density 
deconvolution methods under consideration, we end our experimental section 
with an in-depth look at a simple density estimation problem with a large sample 
size: we drew n = 100,000 observations from the generative model (1) with a 
density g of the form 


g{g) oc exp 




for — 2 < /r < 2, and g{g) = 0 else. 


We note that this density is not contained in the span of either (3) or the basis 
functions implicitly used by decon. Results are shown in Figure 4. 

To give the kernel density method a good chance of performing well, we tried 
four different bandwidth-selection algorithms provided by decon: (1) a closed 
form approximation to the boostrap recommended by Delaigle and Gijbels [2004] 
(this method is default, and was used for the other experiments), (2) a rule of 
thumb by Fan [1991], (3) an approximate ISE minimzer by Stefanski and Carroll 
[1990], and finally (4) a hand-selected bandwidth to demonstrate the effect of 
mild under-smoothing. For Efron’s method we, as usual, set p = 4 in (3). 

We notice that all methods do a roughly equivalent job of estimating the 
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target density g near its mode. However, Efron’s method is much more accu¬ 
rate in estimating the tails and support of g. This observation is not entirely 
surprising in light of the loss functions that were used to motivate each method. 
The ISE loss (51) only requires g to be reasonable close to g over the domain 
of /i, but does not impose particularly harsh penalties for oscillatory behav¬ 
ior in the tails. In contrast, the KL loss places more attention on getting the 
tails and support of the distribution right. It thus appears that kernel methods 
built on ISE-optimality theory are unreliable for answering scientific questions 
that depend on understanding the tail-behavior of g, whereas methods based 
on KL-optimality may perform better. 


5 Discussion 

The work of Efron [2014a, b] presents a surprising challenge to the theory of 
density deconvolution. From, e.g., Efromovich [1997], it may appear that the 
problem of density deconvolution with Gaussian errors is completely solved— 
and that kernel density estimators are optimal for the task. And yet, Efron 
[2014b] found that his maximum likelihood method vastly out-performed meth¬ 
ods related to kernel density estimators for several realistic problems (Efron 
calls this latter approach “/-estimation”). 

The goal of this paper was to understand why Efron’s method could work 
better than kernel density estimators despite the theoretical guarantees avail¬ 
able for the latter. To do so, we introduced a perturbation model inspired by 
Le Cam’s local asymptotic normality theory, and showed that Efron’s method 
is quasi-optimal in this setup for deviance loss. Under the assumption that de¬ 
viance loss comes closer to describing the “real loss function” of a practitioner 
than the integrated squared error loss used to establish optimality properties of 
kernel density estimators, our results can be seen as moving towards an expla¬ 
nation for the empirical success of Efron’s method. 

More broadly, our results highlight a surprisingly strong connection between 
non-parametric density deconvolution and classical likelihood theory. We found 
that—at least locally—there are some directions along which we can accurately 
estimate a signal and maximum-likelihood estimation is quasi-optimal for this 
task; meanwhile, there are other directions in which estimation is hopeless and 
the minimax strategy is to ignore them. From a practical perspective, this con¬ 
nection appears rather reassuring, as low-dimensional large-sample maximum- 
likelihood estimation has proven to be one of the most consistently successful 
ideas in applied statistics. 
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A Proofs 

Proof of Lemma 1 

It suffices to prove the second conclusion, as the alternative expression for the 
relative efficiency coefficient given in (14) can directly be verified to be trans¬ 
formation invariant. We already know that (rj) = Var^ [^^(m)]) so all we 
really need to do is to check that 


iv) = Var^ [E^ [t{fi) | X] ] . 


Although we could also get to this answer by drawing analogies to the work of 
Louis [1982], we will take a direct approach here. In the univariate family with 
statistic t, the expected score function is 


d d f 

— log {x) = — log J K {fi, x) gr, (g) dfi 


J K {g,, x) g'^ ip) dg. 

f K (p, x) g^ (g) dfi 
f K (/r, x) gr, (g) (t(/r) - E^ [t(^)]) dg. 


J K {g, x) gr, ip) dg 


= E^ [t{fj,) I ^ = a:] - E^ [t (/r)]. 


Thus, we conclude that 



Proof of Lemma 2 


By Bayes’ rule, we know that the conditional density of /i given x is 



from this, it directly follows that 


Var [E [t (/r) I a]] = / t"^ (p) (x, g) g^ (p) f ^{x)dxdg,. 
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Similarly, we can check that 


Var [t (^)] = f g (m) dfj.. 

Jq 

For a statistic t (/r) of the form t = a-T, we can write {TT^)a-, with this 

notation, we recover the first result of Lemma 2. The explicit solution given in 
(19) and (20) for a* in terms of the spectrum of Qj MtQt is standard. 


Proof of Theorem 3 

Because the relative efficiency coefficient is transformation invariant, we can 
without loss of generality pick a statistic T for which 


/ T{fi)go{fJ,)dn = 0, (53) 

Jn 

[ T {g)T^{g)gQ{g)dg, X p ■ (54) 

Jn 

Given such a choice, the relative efficiency formula from Lemma 2 simplifies to 


Po (T) = Xmin(^ J^T {x, g.)gl{fi) f ^{x)dxdn'^, (55) 

= Amin {y/go (m) T (fj.)) Pg„ (/T, g) (^ {fi) T)^ dfi^ , 

where Amin(Al) denotes the smallest eigenvalue of a linear operator A. Mini¬ 
mizing the objective (55) subject to the constraint (54) is a standard spectral 
analysis problem. By the Courant-Fischer-Weyl maximin theorem as stated in, 
e.g., Shawe-Taylor et al. [2005], we find that because Pg^ is both self-adjoint 
and compact (and thus also completely continuous). 


max |po (S) : S e L 2 {Vtf , J S (g.) S^{g) go (g) dg = /pxpj = Ap, 

where Ap is the p-the eigenvalue of Pg^', moreover, this maximum is attained by 
setting Sj (g) = Q (g) / y/go{p) for j = 1, ..., p, where the Q are the leading 
eigenvectors of Pg^. We note that Q (g) G L 2 (fl) because Pg^ is compact; Sj is 
then also in L 2 (fl) because go is bounded away from 0 on fl. Finally, because go 
and K are continuous, PggSj is continuous and so Sj must also be continuous. 

Now, we still need to deal with the constraint (53). Thankfully, we can verify 
that all the eigenvalues of Pg^ are bounded by 1 , and that ( p ) ■= \/<70 ( p ) is 
an eigenfunction of Pgg with eigenvalue 1. By orthogonality of the spectrum of 
Pgg , we then see that 


/ {p) dp= f Cj (p) Cl ip) = 0 

Jn V5o ip) Jn 
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for all j > 1. Thus, the minimizer of (55) with both constraints (53) and (54) 
is given by Tj (/i) = {fi) /\/go{P') for j = 1 , p, and the objective value 
(55) is the {p + l)-st eigenvalue of Pg^. Moreover, if the spectrum of Pg^ does 
not have repeated eigenvalues, the span of Ti, ..., Tp maximizing our objective 
is unique. We note that, because go and Sj are continuous and ft is compact, 
E [go (p) exp (rj ■ T (/r))] is finite for g in a neighborhood of 0 and so our most- 
stable family is in fact well-defined. 


Proof of Corollary 4 


Because Pg„ is compact, we know by the eigenfunctions form a complete 

orthonormal basis for L 2 (n); thus, we see that the statistics also form 

a complete orthonormal basis for L 2 (n) with inner product weighted by go and 
can write^ 


i=i 



{g)go{g) = Y^ir 

i=i 


Given this notation, we set Ij (m)- Our goal is to show that, 

for any integer J > p, our choice of rO-*) satisfies the conclusion of Corol¬ 
lary 4 under the assumption that = 0 for all j > J. Because this bound 
holds uniformly in J, we conclude that it also holds in the non-parametric case 

/n^MM)ffo(Ai) < J 

Following the above discussion, we now assume that r (/r) = (/r)) 

and write 7 for the parameter vector inducing r. The target loss is then 


dkl 4ri .)) 

= / In 


x) log 


/r/VS ( 2 ^) 


dx 


JQ 

1 

2n 


fT(p^ *)/\/n {^) ^ 

*) _ y 2 jQg (/^/^ (x)) *) - 7 ^ (x) dx 

+ o(n“^) , 


where the derivative is taken with respect to the parameters 7 . We note 
that the first-order term depending on V log(-) integrates out to 0. Now, taking 
limits, we find that 


lim uDkl (/^”\ 

n—>-oo \ ' / 

= ^ (7^^’ - 7) log (/o (/i)) fo (x) dx *'> - 7) 

^Without loss of generality, 70 = 0 since this term is absorbed by the normalization. 
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where denotes the Fisher information for estimating 7 from X-samples 

in the J-dimensional most-favorable family. But now, Fj (qq) is scaled such that 
2 ^./(go) _ Thus, the limiting approximation error is equal to 


lim uDkl 4 r 2 .)) 

n—>-oo \ ^ / 


v(P’ ' 


((- 


P 7 


{p, *) 


which, by Theorem 3, can be bounded above by \ C'^Ap +2 {Pgo)- 


Proof of Theorem 5 

Given our assumptions, we know that the carrier g and the marginal density of 
the observations / are given by 



where both densities loop around if needed to accommodate the bounded do¬ 
main. For our proof, we begin by verifying that the 


(m) = 



are eigenfunctions of Pg in the limit M = 00 . Then, for large but finite M, the 
Vj are nearly eigenfunctions of Pg] meanwhile, the conditions of Theorem 3 are 
satistified and so our desired conclusion follows. 

Now, to verify that the Vj are eigenfunctions with M = 00 , we hrst note that 
Pg is a compact kernel and so it does in fact admit a spectral decomposition, 
and second that 


J (/ri) Pg {fii, ^2) I'k (M2) dgidg .2 



Now, focusing on the inner terms, we can check that 



where denotes the j-th derivative of the standard Gaussian density with 
respect to its argument. It is well known that the convolution of two Gaussian 
random variables is also a Gaussian random variable whose variance is the sum 
of the original variances, and that convolution commutes with differentiation. 
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In terms of the change of variables s = xja^ we see that 


- 1 1 i ^ « 

y/1 + 0-2 y/j\, \9sy y y'l _|_ 0-2 

1 I ( a y (,-) ^ ^ 

Vl + cr^ \ Vl + cr^ / ^ VVl + cr^/ 

Plugging this expression into our previous formula, we find that 


JVj Pg (/Ti, /i2) l/fe (/^2) dnidfX2 

P"- f (tB?) (7fe») 

v^^»>(7lb) 

Hk 


\/l + cr^ Wl + 




\/l + 0-2 


\/l + 


VT+dr^ 


v/r 


: ip 


VT 


j + k 


Vl + 1 


dec 


idj (a;) idfc (x) p (x) dx 


Vl 


j + k 


S {{j = k}) 


because the Hermite polynomials as defined in (26) are orthonormal with respect 
to the standard Gaussian distribution. By Theorem 3, we thus conclude that 
the most favorable family is given by the first p Hermite polynomials. Moreover, 
again by Theorem 3, the relative efficiency coefficient of this family corresponds 
to the p + 1-st eigenvalue of Pg, i.e., (cr^/ (l -I- cr^))^■ Because Tj was a most 
favorable family for density deconvolution, any other family of the form (2) will 
have worse relative efficiency. 


Proof of Lemma 7 

Continuing the argument our argument from Section 3.1, it remains to derive a 
lower bound for the minimax risk among linear estimators in the Gaussian 
sequence model. We begin by noting that 

< PC < Jc = [log {pc) / log (k)J . 
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Thus, we can expand out (47) as 


Jc 


= ^rl^K^ {fJ-c - 

t=i 


2 

- rtrK, 


^2^2\Jc + l _ ^ 2^2 


= Me 


= (rW 


{rlK^y 


r%K — 1 




Jc+i /mck 1 


\ r^K — 1 




MC 


r^K — 1 r*K^ — 1 


<BlArW) 


;(k- 1) 


2 2\"^C + 1 _ H n2 _ _ 

’ (r 2 /c 2 _i)(^ 2 ^_i)’ 


where the inequality on the third line hold whenever C (and thus also /ic and 
Jc) are large enough. Thus, we conclude that 


Jc 


^ ^ log^yc^TT/i^ 

“ log (r^-K) 


We can also bound the risk in (46) by 



_ ,.2(Jc+i) _ ^2 ^ - tIk 

J,2 _ 1 K^c + i [r'^K — 1) 

> M+1) - 1 ) _ 

{rl - 1 ) (r^K - 1 ) r2 - 1 ■ 

Plugging in our previous bound for Jc, we find that 


Rl > exp 


> exp 


log 


log 


R(7. K 


2 log (r^) 


C 




log (r^) + log (k) 
2 log (r„) 


. (k- 1) 


log (r^) +log(«;) 


(r2 - 1) {rlK - 1) r2 - 1 

rl (k - 1 ) 


{rl - 1 ) {rln - 1 ) ’ 


where the last inequality again holds for large enough C. Once paired with the 
5/4 bound of Donoho et al. [1990] for Pinsker’s constant, this bound yields the 
desired result. 
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Proof of Lemma 8 

By the same argument as in the proof of 9, we can verify that 


lim lim E 

M—>00 n—>00 



Im ^E[Zj -7j]" 

77,—VOO < ^ 




II + H 7|- 

j=i j=p+i 


The above expression is largest if the signal concentrates in the p + 1-st Hermite 
coefficient, i.e., t (/i) = 'jp+iHj (p/a), yielding 

limsup I sup I = 

ri - 1 

Plugging in the choice for p specified in (35) and assuming that C is large enough 
that [log((7/CT)/log(rCTK)] — 1 > 2, we get that 


lim sup 

71—>-00 





- 2 { 

V 


\oS(C/a) \ 
log ro-« / 




which is what we set out to show. 


Proof of Lemma 9 

Our proof proceeds in several parts. We begin by establishing a version of 
our result for a simpler finite-dimensional problem; the general statement then 
shows that the finite-dimensional case can uniformly approximate our problem 
of interest. 


A Finite-Dimensional Model For some J € N, suppose that r is known to 
lie in an ellipse c (T) defined by 


• (m) = ^ 7 , H, (^) , 7 6 (C, J) := I y : ^ y^7| < , 


i=i 


i=i 


and that we only consider estimators over the set AJJ (J) defined by 


(/i)=5f (yexp 


(^) -V'n ( 7 ) 


i=i 


i=i 
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Our first task is to show that the the minimax risk over this finite-dimensional 
parametric class can, for large M, be well-approximated by minimax risk of the 
analogous finite Gaussian problem. We recall that, as usual, [g) denotes the 
Gaussian density g„ (/r) that has been “wrapped around” the interval VIm = 
[-M, M]. 


Convergence of the Likelihood Gonsider any parameter 7 ' whose induced 
tilting function satisfies '^K,c (J), and denote the resulting marginal den¬ 
sity function by oc * g^ Because the basis functions Hj are all 

bounded (recall that we assume a compact domain [—M, M]), the log-likelihood 
(x) is uniformly Lipschitz in 7 ' for all x. Thus, we can use standard em¬ 
pirical process theory results to verify that the log-likelihood at 7 ', namely 
log(n”=i i® entirely determined by the score at the optimum 7 [e.g., 

van der Vaart, 2000, Lemma 19.31]: 


sup 


t^iGA 


re, C 


iJ) 


Vnhg 


, nr = i4 " 


n 

(y-^).^viog (4”) (X,)) 


2=1 


0 . 


As 7 is the true optimal parameter, we know that E..^ V log (4"^ (-^i)) = 0; 

1 -1 . -ir. 1 . . f An) / rere \M.. , . 1 L V . _/ J_ 


meanwhile, nVar, 


’log (/4\(^.)) 


converges to the Fisher information at 


7 = 0. Thus, by the central limit theorem, we can verify that 


^ V log (4") (X,)) ^ N (0, Vm, j) , Vm, j := Varp [V log (/p (X,))]; 

i=l 


Thus, we conclude that the log-likelihood in favor of 7 ' relative to 7 is asymptot¬ 
ically equivalent in distribution to the log-likelihood arising from the Gaussian 
experiment where we observe Zm ~ Af ( 7 , Vm, j) and want to recover 7 G £5 J)- 


Convergence of the Loss Similarly, we can verify that the density estima¬ 
tion loss at 7 ' given the true parameter value 7 satisfies 





and so 

^iIm 

= ( 7 ' - 7 )^ Wm, j ( 7 ' - 7 ), 

with Wm, j ~ (“) “ V'm (0)) (a) dg., 
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where is a vector obtained by stacking the first J Hermite functions. 

Moreover, this convergence is uniform in 7 '. Thus, for large n, our re-scaled 
deviance loss is asymptotically equal to 

Lm, j in') ■= (7' - 7)^ Wm, j (7' - 7) ■ 

Convergence of the Minimax Risk Given the convergence results for the 
log-likelihood and for the loss established above, we should expect the statistical 
problem of density estimation to be asymptotically equivalent to finding the 
mean of a Gaussian vector with variance Vm, j under loss Lm. j- To establish this 
formally, we can use least-favorable priors. Let 7 ^^ j be a minimax estimator 
for 7 in the Gaussian model with variance Vm, j under loss Lm, j, subject to 7 S 
£2 (C, J). By checking the conditions of Wald [1945], we can verify that 7 ]]^ j 
is unique and that it is also Bayes for a least-favorable prior j; moreover, 
the risk of 7 ^ j is constant over the support of t:*m j- Finally, because Lm, j is 
quadratic, ‘^m j is the posterior mean for this least-favorable prior. 

Now, let 7 ^’J ^ be the posterior mean for 7 in the density estimation prob¬ 
lem with n samples, where 7 has a prior j. By our previous results, we find 
that 

C {l^ (5^5^ J) ^C{LM,j{rM,j)) , 

and that the risk of this estimator is asymptotically constant over the support 
of TT^ j] thus, this estimator is asymptotically minimax over the support of 
TT^ J. Finally, the risk of this estimator is never asymptotically worse outside 
the support of j, and so Jm'j ^ asymptotically minimax for our 

finite-dimensional density estimation problem. 


Taking Limits To move from the compact interval LIm to the real line we 
observe that, for any fixed J, we know that the Hermite functions are orthogonal 
over the whole real line: 



X J) 


and so limM->.oo Wm, j = IjxJ- Moreover, by the same argument as used in the 
proof of Theorem 5, we find that Vm, j converges to the leading J x J sub-matrix 
of the covariance matrix defined in (43). Meanwhile, for a fixed M, we can use 
the uniform approximability result from Corollary 4 to verify that the minimax 
risk of density estimation converges as J —> 00 ; a similar argument holds in 
the Gaussian case. Because the minimax risks for both problems match for 
every finite J, they must thus also match as J —>■ 00 . Now, because we have 
established convergence both when J goes to infinity given a fixed M and when 
M goes to infinity given a fixed J, we conclude that the joint limit M, J —>■ 00 
is well-defined and does not depend on the order in which we take the limits; 
this implies the desired result. 
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Proof of Lemma 10 

Asymptotic normality follows directly from the central limit theorem; we only 
need to verify the moments. Now, using similar arguments as in the proof of 

(n) 

Corollary 4, we can verify that the limiting covariance of the does not 
depend on r, and that 


lim Cov 

n—^oo 


= p 


zf\ / U,{x)Uy[x)h{x) dx 

JQ 

-2 f K{x, p,i)Tj{pi)goipi) K{x, p2)Tj,{p2)go{fi2) 




foix) 


dpi dp 2 dx 


= Pp / 0+1 (Mi) Pgo (Mi. P 2 ) O'+i (M 2 ) dpi dp2 


~ ^{j=j'} Pj > 


where Pgg is the linear operator defined in (21) and the (j are its eigenvectors 
as defined in Theorem 3; recall that pj = A^+i {Pgg)- Meanwhile, the Tj are 
centered such that E [Tj{p)] = 0, and so 


\/n 


Z. 


(n) 


= / Ujix)foix) dx 
Jn 


= Kix,p) Tj{p)go{p) dpdx = 0. 
Jfl JQ 


Thus, we can verify that 


lim E 

n—>-oo 


= Pi 


Z) 


np-^ 

n—¥oo 


= lim 


Ujix) fr/V^{x) dx 


Uj{x) 




dx 


J £=0 


= Pn 


-1 


r K {x, pi)Tg ipi)go jpi) K {x, P 2 ) (M 2 )],,^o 

/aa fo (x) 


dpi dp 2 dx 


= Pp / Tg [pi) i/go (pi) Pgg Pgo (M2) t (^2) dpi dp 2 


= / Tj (m) r (m) 9o (m) dp, 


JQ 

where the last equality follow from the spectral theorem because is an 

eigenfunction of Pgg with eigenvalue pj. 
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